Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for batched processing #448

Merged
merged 14 commits into from
Oct 18, 2024
Merged

Optimization for batched processing #448

merged 14 commits into from
Oct 18, 2024

Conversation

HYLcool
Copy link
Collaborator

@HYLcool HYLcool commented Oct 15, 2024

  • Extract the batched processing loop to a new function (e.g. compute_stats_batched), and decide the function branch to be called when processing data.
  • Pros:
    • No need to modify each OP to support batched processing. Only need to set the arg _batched_op to True for each OP.
    • Developers can add new OPs in a single-sample-dict way, which is more friendly.
  • Cons (for now):
    • Slightly slower data processing: 4 scenarios were tested and the results are shown below:
Scenario OPs Before After
Simple Filter text_length_filter 4.850s 5.340s (+10.10%)
Simple Mapper whitespace_normalization_mapper 12.145s 12.575s (+3.54%)
With context (OP fusion) word_num_filter & word_repetition_filter 80.456s 87.315s (+8.53%)
With rank (Model-based OPs) image_text_similarity_filter 562.586s 590.502s (+4.96%)

data_juicer/ops/base_op.py Outdated Show resolved Hide resolved
data_juicer/ops/base_op.py Outdated Show resolved Hide resolved
data_juicer/ops/base_op.py Outdated Show resolved Hide resolved
data_juicer/ops/base_op.py Outdated Show resolved Hide resolved
@drcege drcege added enhancement New feature or request dj:op issues/PRs about some specific OPs labels Oct 15, 2024
@HYLcool HYLcool marked this pull request as ready for review October 17, 2024 02:32
@HYLcool HYLcool self-assigned this Oct 17, 2024
@HYLcool HYLcool changed the title [WIP] Optimization for batched processing Optimization for batched processing Oct 17, 2024
data_juicer/ops/base_op.py Show resolved Hide resolved
docs/DeveloperGuide_ZH.md Show resolved Hide resolved
@drcege
Copy link
Collaborator

drcege commented Oct 18, 2024

Please sync the main branch.

Copy link
Collaborator

@drcege drcege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HYLcool HYLcool merged commit 4384bfa into main Oct 18, 2024
3 checks passed
@HYLcool HYLcool deleted the opt/outer_batched branch October 18, 2024 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:op issues/PRs about some specific OPs enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants